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Abstract 

The  torus  routing  chip  (TRC)  is  a  self-timed  chip  that  performs  deadlock-free  cut-through 
routing  in  fc-ary  n-cube  multiprocessor  interconnection  networks  using  a  new  method  of 
deadlock  avoidance  called  virtual  channels.  A  prototype  TRC  with  byte  wide  self-timed 
communication  channels  achieved  on  first  silicon  a  throughput  of  64Mbits/s  in  each  dimen¬ 
sion,  about  an  order  of  magnitude  better  performance  than  the  communication  networks 
used  by  machines  such  as  the  Caltech  Cosmic  Cube  or  Intel  iPSC.  The  latency  of  the 
cut-through  routing  of  only  150ns  per  routing  step  largely  eliminates  message  locality  con¬ 
siderations  in  the  concurrent  programs  for  such  machines.  The  design  and  testing  of  the 
TRC  as  a  self-timed  chip  was  no  more  difficult  than  it  would  have  been  for  a  synchronous 
chip. 
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1  Introduction 


Message-passing  concurrent  computers  such  as  the  Caltech  Cosmic  Cube  [13]  and  Intel  iPSC 
[6]  consist  of  many  processing  nodes  that  interact  by  sending  messages  over  communication 
channels  between  the  nodes.  We  designed  the  torus  routing  chip  (TRC)  as  a  building 
block  to  construct  high-throughput,  low-latency  k- ary  n-cube  interconnection  networks  for 
message-passing  concurrent  computers. 

The  TRC  is  a  self-timed  VLSI  circuit  that  provides  deadlock-free  packet  communications  in 
fc-ary  n-cube  (torus)  networks  [12]  with  up  to  k  =  256  processors  in  each  dimension.  While 
intended  primarily  for  n  =  2-dimensional  networks,  the  chips  can  be  cascaded  to  handle 
n-dimensional  networks  using  [ TRC  chips  at  each  processing  node.  A  prototype  TRC 
has  been  laid  out,  fabricated,  and  tested. 

Even  if  only  two  dimensions  are  used,  the  TRC  can  be  used  to  construct  concurrent  comput¬ 
ers  with  up  to  2^®  nodes.  It  would  be  very  difficult  to  distribute  a  global  clock  over  an  array 
of  this  size  [4].  To  avoid  this  problem,  the  TRC  is  entirely  self-timed  [11],  thus  permitting 
each  processing  node  to  operate  at  its  own  rate  with  no  need  for  global  synchronization. 
Synchronization,  when  required,  is  performed  by  arbiters  in  the  TRC. 

To  reduce  the  latency  of  communications  that  traverse  more  than  one  channel,  the  TRC 
uses  cut-through  [7]  routing  rather  than  store-and-forward  routing.  Instead  of  reading  an 
entire  packet  into  a  processing  node  before  starting  transmission  to  the  next  node,  the  TRC 
forwards  each  byte  of  the  packet  to  the  next  node  as  soon  as  it  arrives.  Cut-through  routing 
thus  results  in  a  message  latency  that  is  the  sum  of  two  terms,  one  of  which  depends  on  the 
message  length,  L,  and  other  of  which  depends  on  the  number  of  communication  channels 
traversed,  D.  Store-and-forward  routing  gives  a  latency  that  depends  on  the  product  of  L 
and  D.  Another  advantage  of  cut-through  routing  is  that  communications  do  not  use  up  the 
memory  bandwidth  of  intermediate  nodes.  A  packet  does  not  interact  with  the  processor 
or  memory  of  intermediate  nodes  along  its  route.  Packets  remain  strictly  within  the  TRC 
network  until  they  reach  their  destination. 

The  TRC  uses  virtual  channels  to  perform  deadlock-free  routing  in  torus  networks.  By 
splitting  each  physical  channel  into  two  virtual  channels  and  making  routing  dependent 
on  the  virtual  channel  on  which  a  message  arrives,  the  TRC  converts  the  cycle  of  channel 
dependencies  in  each  dimension  into  a  spiral. 

This  paper  describes  the  considerations  that  went  into  the  design  of  the  TRC  in  a  “top- 
down”  order  that  starts  with  a  formal  discussion  of  the  deadlock  problem  in  Section  2.  We 
develop  a  model  of  communications  in  multiprocessor  interconnection  networks  and  prove 
a  strong  theorem  about  deadlock.  Based  on  this  model,  the  concept  of  virtual  channels  is 
presented  in  Section  3.  Sections  4  and  5  present  the  design  of  the  TRC  at  the  system  and 
logical  levels.  Experimental  results  are  reviewed  in  Section  6. 


2  Deadlock-Free  Routing 


Deadlock  in  the  interconnection  network  of  a  concurrent  computer  occurs  when  no  message 
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Figure  1:  Deadlock  in  a  4-Cycle 

can  advance  toward  its  destination  because  the  queues  of  the  message  system  are  full  [8], 
Consider  the  example  shown  in  Figure  1.  The  queues  of  each  node  in  the  4-cycle  are 
filled  with  messages  destined  for  the  opposite  node.  No  message  can  advance  toward  its 
destination;  thus  the  cycle  is  deadlocked.  In  this  locked  state,  no  communication  can  occur 
over  the  deadlocked  channels  until  exceptional  action  is  taken  to  break  the  deadlock. 

The  technique  of  virtual  channels  allows  deadlock-free  routing  to  be  performed  in  any 
strongly-connected  interconnection  network  [2].  This  technique  involves  splitting  physical 
channels  on  cycles  into  multiple  virtual  channels  and  then  restricting  the  routing  so  the 
dependence  between  the  virtual  channels  is  acyclic. 

Definition  1  A  flow  control  digit  or  flit  is  the  smallest  unit  of  information  that  a  queue  or 
channel  can  accept  or  refuse.  Generally  a  packet  consists  of  many  flits.  Each  packet  carries 
its  own  routing  information. 

We  have  adopted  this  complication  of  standard  terminology  to  distinguish  between  those 
flow  control  units  that  always  include  routing  information  -  viz.  packets  -  and  those  lower 
level  flow  control  units  that  do  not  -  viz.  flits.  The  literature  on  computer  networks  [16]  has 
been  able  to  avoid  this  distinction  between  packets  and  flits  because  most  networks  include 
routing  information  with  every  flow  control  unit;  thus  the  flow  control  units  are  packets. 
That  is  not  the  case  in  the  interconnection  networks  used  by  message-passing  concurrent 
computers  such  as  the  Caltech  Cosmic  Cube  [13]. 

We  assume  the  following: 

1.  Every  packet  arriving  at  its  destination  node  is  eventually  consumed. 
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2.  A  node  can  generate  packets  destined  for  any  other  node. 

3.  The  route  taken  by  a  packet  is  determined  only  by  its  destination,  and  not  by  other 
traffic  in  the  network. 

4.  A  node  can  generate  packets  of  arbitrary  length.  Packets  will  generally  be  longer  than 
a  single  flit. 

5.  Once  a  queue  accepts  the  first  flit  of  a  packet,  it  must  accept  the  remainder  of  the 
packet  before  accepting  any  flits  from  another  packet. 

6.  An  available  queue  may  arbitrate  between  packets  that  request  that  queue  space,  but 
may  not  choose  amongst  waiting  packets. 

7.  Nodes  can  produce  packets  at  any  rate  subject  to  the  constraint  of  available  queue 
space  (source  queued). 

The  following  definitions  develop  a  notation  for  describing  networks,  routing  functions,  and 
configurations. 


Definition  2  An  interconnection  network ,  /,  is  a  strongly  connected  directed  graph,  I  = 
G(N,C).  The  vertices  of  the  graph,  N,  represent  the  set  of  processing  nodes.  The  edges  of 
the  graph,  C,  represent  the  set  of  communication  channels.  Associated  with  each  channel, 
cf,  is  a  queue  with  capacity  cap(c,-).  The  source  node  of  channel  ct-  is  denoted  sf  and  the 
destination  node  d,  . 

Definition  3  A  routing  function  R  :  C  X  N  *-*  C  maps  the  current  channel,  cc,  and  des¬ 
tination  node,  rid,  to  the  next  channel,  cn,  on  the  route  from  cc  to  nd,  R(cc,  nd)  =  cn.  A 
channel  is  not  allowed  to  route  to  itself,  cc  ^  cn.  Note  that  this  definition  restricts  the 
routing  to  be  memory  less  in  the  sense  that  a  packet  arriving  on  channel  cc  destined  for  nd 
has  no  memory  of  the  route  that  brought  it  to  cc.  However,  this  formulation  of  routing  as  a 
function  from  C  X  N  to  C  has  more  memory  than  the  conventional  definition  of  routing  as 
a  function  from  N  x  N  to  C.  Making  routing  dependent  on  the  current  channel  rather  than 
the  current  node  allows  us  to  develop  the  idea  of  channel  dependence.  Observe  also  that 
the  definition  of  R  precludes  the  route  from  being  dependent  on  the  presence  or  absence 
of  other  traffic  in  the  network.  R  describes  strictly  deterministic  and  non-adaptive  routing 
functions. 


Definition  4  A  channel  dependency  graph ,  D,  for  a  given  interconnection  network,  I,  and 
routing  function,  R,  is  a  directed  graph,  D  =  G(C,  E).  The  vertices  of  D  are  the  channels 
of  I.  The  edges  of  D,  are  the  pairs  of  channels  connected  by  R: 

E  =  {(cj,Cj)|R(cj,n)  =  Cj  for  some  n  G  N}.  (1) 

Since  channels  are  not  allowed  to  route  to  themselves,  there  are  no  1-cycles  in  D. 

Definition  5  A  configuration  is  an  assignment  of  a  subset  of  N  to  each  queue.  The  number 
of  flits  in  the  queue  for  channel  c<  will  be  denoted  size(c,).  If  the  queue  for  channel  ct-  contains 
a  flit  destined  for  node  nd,  then  member (n^,  ct)  is  true.  A  configuration  is  legal  if 
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Vc,-  e  C,  size(cj)  <  cap(c,). 


(2) 


Definition  6  A  deadlocked  configuration  for  a  routing  function,  R,  is  a  non-empty  legal 
configuration  of  channel  queues  such  that 

Vc,-  €  C,  (Vn  9  member(n,  ct),  n  ±  d,-  and  cy  =  R(cf,  n)  =$►  size(cy)  =  cap  (a,))  (3) 

In  this  configuration  no  flit  is  one  step  from  its  destination,  and  no  flit  can  advance  be¬ 
cause  the  queue  for  the  next  channel  is  full.  A  routing  function,  R,  is  deadlock-free  on  an 
interconnection  network,  I,  if  no  deadlock  configuration  exists  for  that  function  on  that 
network. 


Theorem  1  A  routing  function,  R,  for  an  interconnection  network,  I,  is  deadlock-free  iff 
there  are  no  cycles  in  the  channel  dependency  graph,  D. 


Proof: 

=>■  Suppose  a  network  has  a  cycle  in  D.  Since  there  are  no  1-cycles  in  D,  this  cycle  must 
be  of  length  two  or  more.  Thus  one  can  construct  a  deadlocked  configuration  by  filling  the 
queues  of  each  channel  in  the  cycle  with  flits  destined  for  a  node  two  channels  away,  where 
the  first  channel  of  the  route  is  along  the  cycle. 

<=  Suppose  a  network  has  no  cycles  in  D.  Since  D  is  acyclic  one  can  assign  a  total  order 
to  the  channels  of  C  so  that  if  (c,-,  Cj)  €E  E  then  c,-  >  Cj.  Consider  the  least  channel  in  this 
order  with  a  full  queue,  cj.  Every  channel,  cn,  that  ct  feeds  is  less  than  cj,  and  thus  does 
not  have  a  full  queue.  Thus,  no  flit  in  the  queue  for  cj  is  blocked,  and  one  does  not  have 
deadlock.  | 


3  Virtual  Channels 


Now  that  we  have  established  this  if-and-only-if  relationship  between  deadlock  and  the 
cycles  in  the  channel  dependency  graph,  we  can  approach  the  problem  of  making  a  network 
deadlock-free  by  breaking  the  cycles.  We  can  break  such  cycles  by  splitting  each  physical 
channel  along  a  cycle  into  a  group  of  virtual  channels.  Each  group  of  virtual  channels  shares 
a  physical  communication  channel;  however,  each  virtual  channel  requires  its  own  queue. 

Consider  for  example  the  case  of  a  unidirectional  four-cycle  shown  in  Figure  2A,  N  = 
{no,  ■  •  • ,  n3},  C  —  {c0, . . . ,  c3}.  The  interconnection  graph  I  is  shown  on  the  left  and  the 
dependency  graph  D  is  shown  on  the  right.  We  pick  channel  c0  to  be  the  dividing  channel 
of  the  cycle  and  split  each  channel  into  high  virtual  channels,  Ci0, . . . ,  ci3,  and  low  virtual 
channels,  coo,  ■■■ ,  C03,  as  shown  in  Figure  2B. 

Packets  at  a  node  numbered  less  than  their  destination  node  are  routed  on  the  high  channels, 
and  packets  at  a  node  numbered  greater  than  their  destination  node  are  routed  on  the  low 
channels.  Channel  cqo  is  not  used.  We  now  have  a  total  ordering  of  the  virtual  channels 
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I:  Interconnection  Graph  D:  Dependency  Graph 

Figure  2:  Breaking  Deadlock  with  Virtual  Channels 


according  to  their  subscripts:  cx3  >  ci2  >  cn  >  cio  >  c03  >  c02  >  c0i.  Thus,  there  is 
no  cycle  in  D,  and  the  routing  function  is  deadlock-free.  In  [2]  this  technique  is  applied 
to  construct  deadlock-free  routing  functions  for  A;-ary  n-cubes,  cube-connected  cycles,  and 
shuffle-exchange  networks.  In  each  case  virtual  channels  are  added  to  the  network  and  the 
routing  is  restricted  to  route  packets  in  order  of  decreasing  channel  subscripts.  In  the  next 
two  sections,  the  routing  function  for  k- ary  n-cubes  is  developed  into  a  chip. 

Many  deadlock-free  routing  algorithms  have  been  developed  for  store-and-forward  computer 
communications  networks  [5].  These  algorithms  are  all  based  on  the  concept  of  a  structured 
buffer  pool.  The  packet  buffers  in  each  node  of  the  network  are  partitioned  into  classes,  and 
the  assignment  of  buffers  to  packets  is  restricted  to  define  a  partial  order  on  buffer  classes. 
The  structured  buffer  pool  method  has  in  common  with  the  virtual  channel  method  that 
both  prevent  deadlock  by  assigning  a  partial  order  to  resources.  The  two  methods  differ 
in  that  the  structured  buffer  pool  approach  restricts  the  assignment  of  buffers  to  packets 
while  the  virtual  channel  approach  restricts  the  routing  of  messages.  Either  method  can 
be  applied  to  store-and-forward  networks,  but  the  structured  buffer  pool  approach  is  not 
directly  applicable  to  cut-through  networks,  since  the  flits  of  a  packet  cannot  be  interleaved. 


4  System  Design 
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The  torus  routing  chip  (TRC)  can  be  used  to  construct  arbitrary  k- ary  n-cube  interconnec¬ 
tion  networks.  Each  TRC  routes  packets  in  two  dimensions,  and  the  chips  are  cascadable 
as  shown  in  Figure  3  to  construct  networks  of  dimension  greater  than  two.  The  first  TRC 
in  each  node  routes  packets  in  the  first  two  dimensions  and  strips  off  their  address  bytes 
before  passing  them  to  the  second  TRC.  This  next  chip  then  treats  the  next  two  bytes  as 
addresses  in  the  next  two  dimensions  and  routes  packets  accordingly.  The  network  can  be 
extended  to  any  number  of  dimensions. 

A  block  diagram  of  a  two-dimensional  message-passing  concurrent  computer  constructed 
around  the  TRC  is  shown  in  Figure  4.  Each  node  consists  of  a  processor,  its  local  memory, 
and  a  TRC.  Each  TRC  in  the  torus  is  connected  to  its  processor  by  a  processor  input  channel 
and  a  processor  output  channel.  Connections  on  the  edges  of  the  torus  wrap  around  to  the 
opposite  edge.  One  can  avoid  the  long  end-around  connection  by  folding  the  torus,  as  shown 
in  Figure  5. 

A  flit  in  the  TRC  is  a  byte  whose  8  bits  are  transmitted  in  parallel.  The  X  and  Y  channels 
each  consist  of  8  data  lines  and  4  control  lines.  The  4  control  fines  are  used  for  separate 
request/acknowledge  signal  pairs  for  each  of  two  virtual  channels.  The  processor  channels 
are  also  8  bits  wide,  but  have  only  two  control  fines  each. 

The  packet  format  is  shown  in  Figure  6.  A  packet  begins  with  two  address  bytes.  The  bytes 
contain  the  relative  X  and  Y  addresses  of  the  destination  node.  The  relative  address  in  a 
given  direction,  say  X,  is  a  count  of  the  number  of  channels  that  must  be  traversed  in  the 
X  direction  to  reach  a  node  with  the  same  X  address  as  the  destination.  After  the  address 
comes  the  data  field  of  the  packet.  This  field  may  contain  any  number  of  non-zero  data 
bytes.  The  packet  is  terminated  by  a  zero  tail  byte.  Later  versions  of  the  TRC  may  use  an 
extra  bit  to  tag  the  tail  of  a  packet,  and  might  also  include  error  checking. 
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Figure  4:  A  Torus  System 
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Figure  5:  A  Folded  Torus  System 
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Figure  6:  Packet  Format 
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Figure  7:  Virtual  Channel  Protocol 


The  TRC  network  routes  packets  first  in  the  X  direction,  then  in  the  Y  direction.  Packets 
are  routed  in  the  direction  of  decreasing  address,  decrementing  the  relative  address  at  each 
step.  When  the  relative  X  address  is  decremented  to  zero,  the  packet  has  reached  the  correct 
X  coordinate.  The  X  address  is  then  stripped  from  the  packet,  and  routing  is  initiated  in 
the  Y  dimension.  When  the  Y  address  is  decremented  to  zero,  the  packet  has  reached  the 
destination  node.  The  Y  address  is  then  stripped  from  the  packet,  and  the  data  and  tail 
bytes  are  delivered  to  the  node. 

Each  of  the  X  and  Y  physical  channels  is  multiplexed  into  two  virtual  channels.  In  each 
dimension  packets  begin  on  virtual  channel  1.  A  packet  remains  on  virtual  channel  1  until 
it  reaches  its  destination  or  address  zero  in  the  direction  of  routing.  After  a  packet  crosses 
address  zero  it  is  routed  on  virtual  channel  0.  The  address  0  origin  of  the  torus  network  in 
X  and  Y  is  determined  by  two  input  pins  on  the  TRC.  The  effect  of  this  routing  algorithm 
is  to  break  the  channel  dependency  cycle  in  each  dimension  into  a  two-turn  spiral  similar 
to  that  shown  in  Figure  2.  Packets  enter  the  spiral  on  the  outside  turn  and  reach  the  inside 
turn  only  after  passing  through  address  zero. 

Each  virtual  channel  in  the  TRC  uses  the  2-cycle  signaling  convention  shown  in  Figure  7. 
Each  virtual  channel  has  its  own  request  ( R )  and  acknowledge  (A)  lines.  When  R  =  A, 
the  receiver  is  ready  for  the  next  flit  (byte).  To  transfer  information,  the  sender  waits  for 
R  —  A,  takes  control  of  the  data  lines,  places  data  on  the  data  lines,  toggles  the  R  line,  and 
releases  the  data  lines.  The  receiver  samples  data  on  each  transition  of  R  line.  When  the 
receiver  is  ready  for  the  next  byte,  it  toggles  the  A  line. 

The  protocol  allows  both  virtual  channels  to  have  requests  pending.  The  sending  end  does 
not  wait  for  any  action  from  the  receiver  before  releasing  the  channel.  Thus,  the  other 
virtual  channel  will  never  wait  longer  than  the  data  transmission  time  to  gain  access  to 
the  channel.  Since  a  virtual  channel  always  releases  the  physical  channel  after  transmitting 
each  byte,  the  arbitration  is  fair.  If  both  channels  are  always  ready,  they  will  alternate 
bytes  on  the  physical  channel. 

Consider  the  example  shown  in  Figure  8.  Virtual  channel  XI  gains  control  of  the  physical 
channel,  transmits  one  byte  of  information,  and  releases  the  channel.  Before  this  informa¬ 
tion  is  acknowledged,  channel  X0  takes  control  of  the  channel  and  transmits  two  bytes  of 
information.  Then  XI,  having  by  then  been  acknowledged,  takes  the  channel  again. 
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Figure  9:  TRC  Block  Diagram 


5  Logic  Design 

As  shown  in  Figure  9,  the  TRC  consists  of  five  input  controllers,  a  five  by  five  crossbar 
switch,  five  output  queues,  and  two  output  multiplexers.  There  is  one  input  controller  and 
one  output  controller  for  each  virtual  channel.  The  output  multiplexers  serve  to  multiplex 
two  virtual  channels  onto  a  single  physical  channel. 

The  input  controller  is  responsible  for  packet  routing.  When  a  packet  header  arrives,  the 
input  controller  selects  the  output  channel,  adjusts  the  header  by  decrementing  and  some¬ 
times  stripping  the  byte,  and  then  passes  all  bytes  to  the  crossbar  switch  until  the  tail  byte 
is  detected. 

The  input  controller,  shown  in  Figure  10,  consists  of  a  datapath  and  a  self-timed  state 
machine.  The  datapath  contains  a  latch,  a  zero  checker,  and  a  decrementer.  A  state 
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Figure  10:  Input  Controller  Block  Diagram 


latch,  logic  array,  and  control  logic  comprise  the  state  machine.  When  the  request  line 
for  the  channel  is  toggled,  data  is  latched,  and  the  zero  checker  is  enabled.  When  the 
zero  checker  makes  a  decision,  the  logic  array  is  enabled  to  determine  the  next  state,  the 
selected  crossbar  channel,  and  whether  to  strip,  decrement,  or  pass  the  current  byte.  When 
the  required  operation  has  been  completed,  possibly  requiring  a  round  trip  through  the 
crossbar,  the  state  and  selected  channel  Eire  saved  in  cross-coupled  multi-flops  and  the  logic 
array  is  precharged. 

The  input  controller  and  all  other  internal  logic  operates  using  a  4-cycle  self-timed  signaling 
convention  [11].  One  function  of  the  state  machine  control  logic  is  to  convert  the  external 
2-cycle  signaling  convention  into  the  on-chip  4-cycle  signaling  convention.  The  signaling 
convention  is  converted  back  to  2-cycle  at  the  output  pads. 

The  crossbar  switch  performs  the  switching  and  arbitration  required  to  connect  the  five 
input  controllers  to  the  five  output  queues.  A  single  crosspoint  of  the  switch  is  shown  in 
Figure  11.  A  two- input  interlock  (mutual-exclusion)  element  in  each  crosspoint  arbitrates 
requests  from  the  current  input  channel  (row)  with  requests  from  all  lower  channels  (rows). 
The  interlock  elements  are  connected  in  a  priority  chain  so  that  an  input  channel  must  win 
the  arbitration  in  the  current  row  and  all  higher  rows  before  gaining  access  to  the  output 
channel  (column). 

The  output  queues  buffer  data  from  the  crossbar  switch  for  output.  The  queues  are  each  of 
length  four.  While  a  shorter  queue  would  suffice  to  decouple  input  and  output  timing,  the 
longer  queue  also  serves  to  smooth  out  the  variation  in  delays  due  to  channel  conflicts. 

Each  output  multiplexer  performs  arbitration  and  switching  for  the  virtual  channels  that 
share  a  common  physical  channel.  As  shown  in  Figure  12,  a  small  self-timed  state  machine 
sequences  the  events  of  placing  the  data  on  the  output  pads,  asserting  request,  and  removing 
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the  output  data.  An  interlock  element  is  used  to  resolve  conflicts  between  channels  for  the 
data  pads. 

To  interface  the  on-chip  equipotential  region  to  the  off-chip  equipotential  region  that  con¬ 
nects  adjacent  chips,  self-timed  output  pads  (Figure  7.22  in  [11])  are  used.  A  Schmidt 
Trigger  and  exclusive-OR  gate  in  each  of  these  pads  signals  the  state  machine  when  the  pad 
is  finished  driving  the  output.  These  completion  signals  are  used  to  assure  that  the  data 
pads  are  valid  before  the  request  is  asserted  and  that  the  request  is  valid  before  the  data  is 
removed  from  the  pads  and  the  channel  released. 


6  Experimental  Results 


The  design  of  the  TRC  began  in  August  1985.  The  chip  was  completely  designed  and 
simulated  at  the  transistor  level  before  any  layout  was  performed.  The  circuit  design  was 
described  using  CNTK,  a  language  embedded  in  C  [3],  and  was  simulated  using  MOSSIM 
[1].  A  subtle  error  in  the  self-timed  controllers  was  discovered  at  the  circuit  level  before 
any  time-consuming  layout  was  performed.  Once  the  circuit  design  was  verified,  the  TRC 
was  laid  out  in  the  new  MOSIS  scalable  CMOS  technology  [17]  using  the  Magic  system 
[10],  A  second  circuit  description  was  generated  from  the  artwork  and  six  layout  errors 
were  discovered  by  simulation  of  the  extracted  circuit.  The  verified  layout  was  submitted 
to  MOSIS  for  fabrication  in  September  1985. 

The  first  batch  of  chips  was  completed  the  first  week  of  December  but  failed  to  function 
because  of  fabrication  errors.  A  second  run  of  chips  (same  design),  returned  the  second 
week  of  December,  contained  some  fully  functional  chips. 
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Figure  12:  Output  Multiplexer  Control 

Performance  measurements  on  the  chips  are  shown  in  Figure  14.  To  measure  the  maximum 
channel  rate,  the  output  request  and  acknowledge  lines  were  tied  together,  and  the  input 
acknowledge  was  inverted  and  fed  back  into  input  request.  In  this  configuration  the  chip 
runs  at  a  maximum  speed,  shown  in  Figure  14A,  of  ps4MHz.  This  sluggish  performance, 
about  one  fifth  of  what  we  expected,  was  traced  to  an  overlooked  critical  path  in  the  input 
controller.  The  chip  still  functioned  correctly  thanks  to  the  self-timing. 

The  delays  from  input  request  to  output  request  and  input  acknowledge,  shown  in  Fig¬ 
ure  14B,  are  150ns  and  250ns  respectively.  Data  propagation  time  from  input  to  output 
(not  shown)  was  measured  to  be  60ns  for  both  rising  and  falling  edges.  Thus  data  is  set  up 
90ns  ahead  of  the  output  request.  Data  hold  time,  shown  in  Figure  14C,  is  20ns. 

Tau  model  calculations  suggest  that  a  redesigned  TRC  should  operate  at  20MHz  and  have 
an  input  to  output  delay  of  50ns.  The  redesign  will  involve  decoupling  the  timing  of  the 
input  controller  by  placing  single  stage  queues  between  the  input  pads  and  input  controller 
and  between  the  input  controller  and  the  crossbar  switch.  The  input  controller  will  be 
modified  to  speed  up  critical  paths. 


7  Conclusion 


This  work  was  motivated  by  the  ongoing  design  and  implementation  of  experimental  con¬ 
current  computers  at  Caltech  and  the  investigation  [15]  of  interconnection  networks  for 
these  machines.  A  strong  argument  for  a  binary  n-cube  interconnection  was  the  existence 
of  the  e-cube  algorithm  [9]  for  deadlock-free  packet  routing.  Until  the  development  of  vir¬ 
tual  channels,  we  knew  of  no  comparable  algorithm  for  cubes  of  higher  arity.  The  TRC 
demonstrates  the  use  of  virtual  channels  to  provide  deadlock-free  packet  routing  in  k- ary 
n-cube  multiprocessor  communication  networks. 

Communication  between  nodes  of  a  message-passing  concurrent  computer  need  not  be 
slower  than  the  communication  between  the  processor  and  memory  of  a  conventional  sequen- 
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Figure  13:  The  Torus  Routing  Chip 
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Figure  14:  TRC  Performance  Measurements 
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tial  computer.  By  using  byte-wide  datapaths  and  cut-through  routing,  the  TRC  provides 
node-to-node  communication  times  that  approach  main  memory  access  times  of  sequential 
computers.  Communications  across  the  diameter  of  a  network,  however,  require  substan¬ 
tially  longer  than  a  memory  access  time. 

In  spite  of  our  past  success  in  building  machines  using  binary  n-cube  interconnection  net¬ 
works,  there  are  some  compelling  reasons  to  experiment  with  machines  using  a  torus  net¬ 
work.  First,  the  torus  is  easier  to  wire.  Any  network  topology  must  be  embedded  in  the 
plane  for  implementation.  The  torus  maps  naturally  into  the  plane  with  all  wires  the  same 
length;  the  cube  maps  into  the  plane  in  a  less  uniform  way.  Second,  the  torus  more  evenly 
distributes  load  to  communication  channels.  When  a  cube  is  embedded  in  the  plane,  a  satu¬ 
rated  communication  channel  may  run  parallel  to  an  idle  channel.  In  the  torus,  by  grouping 
these  channels  together  to  make  fewer  but  higher  bandwidth  channels,  the  saturated  channel 
can  use  all  of  the  idle  channel’s  capacity. 

Compare,  for  example,  a  256-node  binary  8-cube  with  a  256-node  16-ary  2-cube  (16  X  16 
torus)  constructed  with  the  same  bisection  width.  If  the  8-cube  uses  single  bit  communi¬ 
cation  channels,  256  wires  will  pass  through  a  bisection  of  the  cube,  128  in  each  direction. 
Thus,  with  the  same  amount  of  wire  we  can  construct  a  torus  with  8-bit  wide  communi¬ 
cation  channels.  Assuming  the  channels  operate  at  the  same  rate  1,  by  choosing  the  torus 
network  we  trade  a  4-fold  increase  in  diameter  (from  8  to  32)  for  a  8-fold  increase  in  channel 
throughput.  In  general,  for  a  IV  =  2n  node  computer  we  trade  a  increase  in  diameter 
for  a  increase  in  channel  throughput. 

We  plan  to  use  the  TRC  and  its  successors  in  future  experimental  concurrent  computers. 
Our  first  machine  will  use  the  TRC  along  with  commercial  microprocessors  and  memory 
parts  to  construct  a  2-dimensional  torus  of  several  hundred  processors.  In  3 /mi  scalable 
CMOS  technology  the  TRC  measures  4.5mm  X  6.5mm  with  pads.  After  scaling  to  1.6/rni 
technology  there  will  room  on  a  single  die  to  combine  both  the  TRC  and  a  simple  processor. 
With  further  scaling  some  of  the  processor’s  local  memory  may  be  moved  on-chip. 

The  TRC  serves  as  still  another  counterexample  to  the  myth  that  self-timed  systems  are 
more  complex  than  synchronous  systems.  The  design  of  the  TRC  is  not  significantly  more 
complex  than  a  synchronous  design  that  performs  the  same  function.  As  for  speed,  the 
TRC  will  certainly  be  faster  than  a  synchronous  chip  since  each  chip  can  operate  at  its  full 
speed  with  no  danger  of  timing  errors.  A  synchronous  chip  is  generally  operated  at  a  slower 
speed  that  reflects  the  timing  of  a  worst-case  chip  and  adds  a  timing  margin. 

The  real  challenge  in  concurrent  computing  is  software.  The  development  of  concurrent 
software  is  strongly  influenced  by  available  concurrent  hardware.  We  hope  that  by  providing 
machines  with  higher  performance  internode  communication  we  will  encourage  concurrency 
to  be  exploited  at  a  finer  grain  size  in  both  system  and  application  software. 


lThis  assumption  favors  the  cube  since  some  of  its  channels  are  quite  long  while  the  torus  channels  are 
uniformly  short. 
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